Skip to content

Conversation

@kalleep
Copy link
Contributor

@kalleep kalleep commented Dec 4, 2025

PR Description

During hackaton I decided to explore new api for our internal / vendored tail package.

The package had questionable locking and it is really hard to determine if it was done correctly. The design of it also spawned 2 goroutines per file we tailed and everything was communicated over a channel.

So the design I came up with is a much smaller api surface with only two public methods, Next and Stop.

Next will return next line of the file or block until either a file event occurs or the file is stopped.
Stop will cancel any blocked calls to Next and close the file.

I also added offset to Line, this is the offset into the file right after the line we just consumed. This allows us to remove an additional goroutine in tailer.go, the backgroud job updating position and metrics. We can instead do this during the read loop still respecting the interval for updates.

I ran two alloy collectors with the same config and tailing 20 files

2025-12-04-11:52:59

This will drastically reduce number of goroutines and reduce the memory overhead that comes with that for alloy instances tailing many files.

I also removed usage of loki.EntryHandler and set labels directly on entry we send, this will remove one additional goroutine.

So in total it would be 4 less goroutines per filed taild.

Which issue(s) this PR fixes

Notes to the Reviewer

  • Watcher was refactored into two functions instead, one that block until a file exists or context is canceled and one that blocks until a file event is detected, no need to run these as backgroud jobs and send events through channels

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated
  • Config converters updated

@kalleep kalleep requested a review from a team as a code owner December 4, 2025 10:56
Copy link
Contributor Author

@kalleep kalleep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what I should put in the changelog, any suggestions?

@kalleep kalleep added publish-dev:linux builds and deploys an image to grafana/alloy-dev container repository and removed publish-dev:linux builds and deploys an image to grafana/alloy-dev container repository labels Dec 4, 2025
@kalleep kalleep requested a review from Copilot December 4, 2025 14:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the internal tail package to simplify its API and reduce resource usage. The refactoring eliminates 4 goroutines per tailed file by replacing the channel-based design with direct method calls (Next() and Stop()), consolidating position updates into the main read loop, and removing the loki.EntryHandler wrapper. The new design reduces memory overhead and complexity while maintaining all existing functionality.

Key Changes:

  • Simplified tail package API from channel-based to blocking method calls with Next() and Stop() methods
  • Added Offset field to Line struct to enable inline position tracking without a separate goroutine
  • Consolidated file watching logic into standalone blocking functions instead of background goroutines

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
internal/component/loki/source/file/tailer.go Refactored to use new tail API, removed separate position update goroutine, merged label handling into entry creation
internal/component/loki/source/file/tailer_test.go Updated test to accommodate new tailer startup behavior
internal/component/loki/source/file/internal/tail/file.go New implementation of file tailing with blocking Next() method
internal/component/loki/source/file/internal/tail/file_test.go Comprehensive test suite for new File implementation
internal/component/loki/source/file/internal/tail/line.go Added Line struct with Offset field for position tracking
internal/component/loki/source/file/internal/tail/config.go New configuration structures for File and WatcherConfig
internal/component/loki/source/file/internal/tail/block.go Standalone blocking functions for file existence and event detection
internal/component/loki/source/file/internal/tail/fileext/file_*.go Consolidated and renamed from winfile package, includes platform-specific file operations
internal/component/loki/source/file/internal/tail/watch/*.go Removed old watch package with channel-based file watching
internal/component/loki/source/file/internal/tail/tail.go Removed old channel-based tail implementation
internal/component/loki/source/file/internal/tail/util/util.go Removed unused utility package
internal/component/loki/source/file/internal/tail/README.md Removed outdated README referencing old implementation
Comments suppressed due to low confidence (1)

internal/component/loki/source/file/internal/tail/fileext/file_windows.go:104

  • Typo in comment: "ended information" should be "extended information" to match the description above.

Comment on lines 204 to 209
MinBackoff: defaultWatcherConfig.MaxPollFrequency,
MaxBackoff: defaultWatcherConfig.MaxPollFrequency,
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backoff configuration uses defaultWatcherConfig.MaxPollFrequency instead of f.cfg.WatcherConfig.MaxPollFrequency. This means that even if the user configured custom polling frequencies, the reopen logic will always use the default values (250ms). This is inconsistent with the rest of the file watching behavior and could lead to unexpected delays or polling frequencies during file reopens. Consider using f.cfg.WatcherConfig.MaxPollFrequency instead.

Suggested change
MinBackoff: defaultWatcherConfig.MaxPollFrequency,
MaxBackoff: defaultWatcherConfig.MaxPollFrequency,
MinBackoff: f.cfg.WatcherConfig.MaxPollFrequency,
MaxBackoff: f.cfg.WatcherConfig.MaxPollFrequency,

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, AI's comment seems valid here, can you leave a comment to explain?

Copy link
Contributor Author

@kalleep kalleep Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the original implementation had this too https://github.com/grafana/alloy/blob/main/internal/component/loki/source/file/internal/tail/tail.go#L229

So I just kept the same behavior but we can for sure change this

@kalleep kalleep added publish-dev:linux builds and deploys an image to grafana/alloy-dev container repository and removed publish-dev:linux builds and deploys an image to grafana/alloy-dev container repository labels Dec 4, 2025
@kalleep kalleep force-pushed the kalleep/refactor-tailer branch from 9d7217a to 1fa3763 Compare December 5, 2025 09:02
@kalleep kalleep added os:windows publish-dev:linux builds and deploys an image to grafana/alloy-dev container repository and removed os:windows publish-dev:linux builds and deploys an image to grafana/alloy-dev container repository labels Dec 5, 2025
@kalleep kalleep marked this pull request as draft December 5, 2025 10:35
@kalleep kalleep force-pushed the kalleep/refactor-tailer branch from 4d413e9 to 1fa3763 Compare December 5, 2025 10:53
@kalleep
Copy link
Contributor Author

kalleep commented Dec 5, 2025

Deployed this to our biggest internal cluster:
2025-12-05-14:05:00

@kalleep kalleep added publish-dev:linux builds and deploys an image to grafana/alloy-dev container repository and removed publish-dev:linux builds and deploys an image to grafana/alloy-dev container repository labels Dec 8, 2025
@kalleep kalleep marked this pull request as ready for review December 8, 2025 09:06

line = strings.TrimRight(line, "\n")
// Trim Windows line endings
line = strings.TrimSuffix(line, "\r")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just TrimRight "\r\n"? TrimRight will remove all of the trailing characters that are in the set, so it'll remove anything, and the current behavior will only trim a single newline on Windows but all newlines on non-Windows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what was there before but you are right we can just use strings.TrimRight("\r\n") 👍

Comment on lines 204 to 209
MinBackoff: defaultWatcherConfig.MaxPollFrequency,
MaxBackoff: defaultWatcherConfig.MaxPollFrequency,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, AI's comment seems valid here, can you leave a comment to explain?

@kalleep kalleep force-pushed the kalleep/refactor-tailer branch from 27d7fc2 to 12dcebf Compare December 10, 2025 08:45
@kalleep kalleep requested a review from dehaansa December 10, 2025 08:53
@github-actions
Copy link
Contributor

github-actions bot commented Dec 10, 2025

🔍 Dependency Review

gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7 -> removed — ✅ Safe

Summary:

  • The direct dependency on gopkg.in/tomb.v1 has been removed.
  • The codebase has already been migrated away from tomb to native context-based cancellation and a backoff loop (github.com/grafana/dskit/backoff).
  • No additional code changes are required to adopt this dependency removal.

Why this is safe:

  • All previous usages of tomb.Tomb within the forked tail package have been removed and replaced with context-based control flows.
  • The tail logic has been refactored into a smaller, self-contained tailer that polls with backoff using a provided context and no longer relies on the tomb lifecycle.

Key migrations observed in this PR:

  • tomb-driven watcher and polling logic removed:
    • internal/component/loki/source/file/internal/tail/watch/* removed
    • internal/component/loki/source/file/internal/tail/tail.go removed (which used tomb.Tomb)
  • Replaced with context and backoff:
    • New block polling: internal/component/loki/source/file/internal/tail/block.go
    • New File tailer API using context and dskit/backoff: internal/component/loki/source/file/internal/tail/file.go
    • Windows file handling consolidated under fileext with delete-pending detection

Relevant snippets (before vs after):

  • Old (tomb used in tail loop and watcher):
    // tail.go (removed)
    type Tail struct {
        ...
        tomb.Tomb // provides: Done, Kill, Dying
    }
    
    func (tail *Tail) tailFileSync() {
        defer tail.Done()
        defer tail.close()
    
        for {
            ...
            case <-tail.changes.Modified:
            case <-tail.Dying():
                return
            }
        }
    }
    
  • New (context + backoff):
    // block.go (new)
    backoff := backoff.New(ctx, backoff.Config{
        MinBackoff: cfg.WatcherConfig.MinPollFrequency,
        MaxBackoff: cfg.WatcherConfig.MaxPollFrequency,
    })
    ...
    for backoff.Ongoing() {
        ...
        backoff.Wait()
    }
    
    // file.go (new)
    ctx, cancel := context.WithCancel(context.Background())
    ...
    func (f *File) Next() (*Line, error) {
        if f.ctx.Err() != nil {
            return nil, f.ctx.Err()
        }
        ...
        if errors.Is(err, io.EOF) {
            if err := f.wait(text != ""); err != nil {
                return nil, err
            }
            goto read
        }
    }
    
    func (f *File) Stop() error {
        f.cancel()
        f.mu.Lock()
        defer f.mu.Unlock()
        return f.file.Close()
    }
    

If you need to remove tomb usage elsewhere in your codebase, here’s a concise migration pattern:

  • Replace a tomb-driven goroutine lifecycle with context cancellation.
  • Replace tomb.Dying() checks with <-ctx.Done() or checking ctx.Err().
  • Return context.Canceled when stopping to mirror previous stop semantics.

Example migration diff:

- type Worker struct {
-   Tomb tomb.Tomb
- }
+ type Worker struct {
+   ctx    context.Context
+   cancel context.CancelFunc
+ }

- func (w *Worker) Start() {
-   w.Tomb.Go(func() error {
-     for {
-       select {
-       case <-w.Tomb.Dying():
-         return nil
-       default:
-         // work
-       }
-     }
-   })
- }
+ func (w *Worker) Start() {
+   w.ctx, w.cancel = context.WithCancel(context.Background())
+   go func() {
+     defer func() { /* cleanup */ }()
+     for {
+       select {
+       case <-w.ctx.Done():
+         return
+       default:
+         // work
+       }
+     }
+   }()
+ }

- func (w *Worker) Stop() error {
-   w.Tomb.Kill(nil)
-   return w.Tomb.Wait()
- }
+ func (w *Worker) Stop() error {
+   if w.cancel != nil {
+     w.cancel()
+   }
+   return nil
+ }

No changes to imports are necessary beyond removing gopkg.in/tomb.v1 (imports that still reference tomb would fail to compile, but the current codebase no longer contains such imports).

Notes

  • No other direct dependency version changes were detected in go.mod besides the removal of gopkg.in/tomb.v1.
  • New polling/backoff behavior uses github.com/grafana/dskit/backoff (already present in the repository outside of go.mod diff scope).

@kalleep
Copy link
Contributor Author

kalleep commented Dec 10, 2025

Test that fails in windows is not related to this pr and seems to be broken on main

@kalleep kalleep force-pushed the kalleep/refactor-tailer branch from 3c6c2a8 to 1b6fdf8 Compare December 10, 2025 13:41
@kalleep kalleep force-pushed the kalleep/refactor-tailer branch from 1b6fdf8 to 9f08d05 Compare December 10, 2025 14:20
Copy link
Contributor

@ptodev ptodev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that with less goroutines there is a higher risk of the pipeline being blocked? Would it be more likely to experience slowness due to the larger number of synchronous calls? E.g.:

  • If the reading is slow
  • If the positions updating is slow (I know this is unlikely, but probably worth considering)
  • If the sending is slow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

os:windows publish-dev:linux builds and deploys an image to grafana/alloy-dev container repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants